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Utility Patent Application RP A 1 ; 74 1 6/78599 

TITLE OF THE INVENTION 

MODIFICATIONS IN THE MULTI-BAND EXCITATION 
(MBE) MODEL FOR GENERATING HIGH QUALITY 
SPEECH AT LOW BIT RATES 

CROSS REFERENCE TO RELATED APPLICATION 

This application claims the benefit of U.S. Provisional Application No. 60/161,681, filed 
October 26, 1999, 

FIELD OF THE INVENTION 

The invention relates to processing a speech signal. In particular, the invention relates 
to speech compression and speech coding. 

BACKGROUND OF THE INVENTION 

Compressing speech to low bit rates while maintaining high quality is an important 
problem, the solution to which has many applications, such as, for example, memory 
constrained systems. One compression scheme (coders) used to solve this problem is multi- 
band excitation (MBE), a scheme derived from sinusoidal coding. 

The MBE scheme involves use of a parametric model, which segments speech into 
frames. Then, for each segment of speech, excitation and system parameters are estimated. 
The excitation parameters include pitch frequency values, voiced/unvoiced decisions and the 
amount of voicing in case of voiced frames. The system parameters include spectral 
magnitude and spectral amplitude values, which are encoded based on whether the excitation 
is sinusoidal or harmonic. 

Though coders based on this model have been successful in sj^thesizing intelligible 
speech at low bit rates, they have not been successful in synthesizing high quality speech, 
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mainly because of incorrect parameter estimation. As a result, these coders have not been 
widely used. Some of the problems encountered are listed as follows. 

In the MBE model, parameters have a strong dependence on pitch frequency because 
all other parameters are estimated assuming that the pitch frequency has been accurately 
computed. 

Most sinusoidal coders, including the MBE based coders, depend on an accurate 
reproduction of the harmonic structure of spectra for voiced speech segments. Consequently, 
estimating the pitch frequency becomes important because harmonics are multiples of the 
pitch frequency. 

Another important aspect of the MBE scheme is the classification of a segment as 
voiced, unvoiced or silence segment. This is important because the three types of segments 
are represented differently and their representations have a different impact on the overall 
compression efficiency of the scheme. Previous schemes use inaccurate measures, such as 
zero-crossing rate and auto-correlation for these decisions. 

MBE based coders also suffer from undesirable perceptual effects arising out of 
saturation caused by unbalanced output waveforms. An absence of phase information in 
decoders in use causes the unbalance. 

PubUcations relevant to voice encoding include: McAulay et al., "Mid-Rate Coding 
based on a sinusoidal representation of speech", Proc. ICASSP85, pp.945-948, Tampa, Fla,, 
Mar. 26-29, 1985 (discusses the sinusoidal transform speech coder); Griffin, "Multi-band 
Excitation Vocoder", Ph.D. Thesis, M.LT, 1987, (Discusses the Multi-Band Excitation 
(MBE) speech model and an 8000 kbps MBE speech coder); SM. Thesis, M.LT, May 1988, 
(discusses a 4800 bps Multi-Band Excitation speech coder); McAulay et al., 
"Computationally efficient Sine- Wave Synthesis and its applications to Sinusoidal Transform 
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coding", Proc. ICASSP 88, New York , N.Y., pp. 370-373, April 1988, (discusses frequency 
domain voiced synthesis); D.W. Griffin, J.S. Lim, "Multi-band Excitation Vocoder," IEEE 
Trans. Acoust,, Speech, Signal Processing, vol. 36, pp. 1223-1235, August 1988; Tian Wang, 
Kun Tang, Chonxgi Feng "A high quality MBE-LPC-FE Speech coder at 2.4 kbps and 1.2 
kbps, Dept. of Electronic Engineering, Tsinghua University, Beijing, 100084, P.R. Chinna; 
Engin Erzin, Arun kumar and Allen Gersho "Natural quality variable-rate spectral speech 
coding below 3.0 kbps, Dept. of Electrical & Computer Eng., University of California, Santa 
Barbara, Ca, 93106 USA; INMARSAT M voice codec, Digital voice systems Inc. 1991, 
version 3.0 August 1991; A.M. Kondoz, Digital speech coding for low bit rate 
communication systems, John Wiley and Sons; Telecommunications Industry Association 
(TIA) "APCO project 25 Vocoder description" Version 1.3, July 15, 1993, IS102BABA 
(discusses 7.2 kbps IMBE speech coder for APCO project 25 standard); U.S. Pat. No. 
5,081,681 (discloses MBE random phase synthesis); Jayant et al.. Digital Coding of 
Waveforms, Prentice-Hall, 1984, (discussing the speech coding in general); U.S Patent No. 
4,885,790 (discloses sinusoidal processing method); Makhoul, ''A mixed-source model for 
speech compression and synthesis", IEEE (1978) ,pp. 163-166 ICASSP78; Griffin et al. 
"Signal estimation from modified short-time fourier transform", IEEE transactions on 
Acoustics, speech and signal processing, vol. ASSP-32, No. 2 , Apr. 1984, pp 236-243; 
Hardwick, "A 4.8 kbps multi-band excitation speech coder", S.M. Thesis, M.I.T., May 1988; 
P. Bhattacharya, M. Singhal and Sangeetha, "An analysis of the weaknesses of the MBE 
coding scheme," IEEE international conf on personal wireless communications, 1999; 
Almeida et al., "Harmonic coding: A low bit rate, good quality speech coding technique," 
IEEE (CH 1746-7/82/000 1684) pp. 1664-1667 (1982); Digital voice systems, hic. "The 
DVSI IMBE speech compression system," advertising brochure (May 12, 1993); Hardwick et 
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al., "The application of the IMBE speech coder to Mobile communications," IEEE (1991), 
pp.249-252 ICASSP 91 May 1991; Portnoff, "Short-time fourier analysis of samples speech", 
IEEE transactions on acoustics, speech and signal processing , vol. ASSP-29, No-3, Jun. 
1981, pp. 324-333; W.B Klein and K.K. Paliwal "Speech coding and synthesis"; Akaike H., 
"Power spectrum estimation through auto-regressive model fitting," Ann. Inst. Statist. Math., 
Vol. 21, pp. 407-419, 1969; Anderson, T.W., "The statistical analysis of time series," Wiley, 
1971; Durbin, J., "The fitting of time-series models," Rev. Inst. Int. Statist., Vol. 28, pp. 233- 
243, 1960; Makhoul J., "Linear Prediction: a tutorial review," Proc. IEEE, Vol. 63, pp. 561- 
580, April 1975; Kay S. M., "Modem spectral estimation: theory and application," Prentice 
Hall, 1988; Mohanty M., "Random signals estimation and identification," Van Nostrand 
Reinhold, 1986. The contents of these references are incorporated herein by reference. 

Various methods have been described for pitch tracking but each method has its 
respective Hmitations. In "Processing a speech signal with estimated pitch" (U.S. Patent No. 
5,226,108), Hardwick, et al. has described a sub-multiple check method for pitch, a pitch 
tracking algorithm for estimating a correct pitch frequency and a voiced/unvoiced decision of 
each band, which is based on an energy threshold value. 

In "Voiced/unvoiced estimation of an acoustic signal" (U.S. Patent No. 5,216,747), 
Hardwick et al. has described a method for estimating voiced/unvoiced classifications for 
each band. The estimation, however, is based on a threshold value, which depends upon the 
pitch and the center fi-equency of each band. Similarly, in INMARSAT M voice codec 
(Digital voice systems Inc. 1991, version 3.0 August 1991) the voiced/unvoiced decision for 
each band depends upon threshold values which in turn depend upon the energy of current 
and previous fi*ames. Occasionally, these parameters are not updated well, which results in 
incorrect decisions for some bands and a deteriorated output speech quality. 
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In "Synthesis of MBE based coded speech using regenerated phase information" 
(U.S. Patent No. 5,701,390), Griffin et al. has described a method for generating a voiced 
component phase in speech synthesis. The phase is estimated from a spectral envelope of the 
voiced component (e.g. from the shape of the spectral envelope in the vicinity of the voiced 
component). The decoder reconstructs the spectral envelope and voicing information for 
each of a plurality of frames. The voicing information is used to determine whether frequency 
bands for a particular spectrum are voiced or unvoiced. Speech components for voiced 
frequency bands are synthesized using the regenerated spectral phase information. 
Components for unvoiced frequency bands are generated using other techniques. 

The discussed methods do not provide solutions to the problems described above. 
The invention presents solutions to these problems and provides significant improvements to 
the quality of MBE based speech compression algorithms. For example, the invention 
presents a novel method for reducing the complexity of unvoiced synthesis at the decoder. It 
also describes a scheme for making the voiced/unvoiced decision for each band and 
computing a single Voicing Parameter, which is used to identify a transition point from a 
voiced to an unvoiced region in the spectrum; Compact spectral amplitude representation is 
also described. 

BRIEF SUMMARY OF THE INVENTION 

The invention includes methods to improve the estimation of parameters associated 
with the MBE model, methods that reduce the complexity of certain modules, and methods 
that facilitate the compact representation of parameters. 

For example, one aspect of the invention relates to an improved pitch-tracking method 
to estimate pitch with greater accuracy. Pursuant to a first method that incorporates 
principles of the invention, five potential pitch candidates from each of a past, a current and a 
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future frame are considered and a best path is traced to determine a correct pitch for the 
current frame. Moreover, pursuant to the first method, an improved sub-multiple checks 
algorithm, v^hich checks for multiples of pitch and eliminates the multiples based on 
heuristics may be used. 

Another aspect of the invention features a novel method for classifying active speech. 
This method, vv^hich is based on a number of parameters, determines whether a current frame 
is silence, voiced or unvoiced. The frame infotmation is collected at different points in an 
encoder, and a final silence-voiced-unvoiced decision is made based on the cumulative 
information collected. 

Another aspect of the invention features a method for estimating voiced/unvoiced 
decisions for each band of a spectrum and for determining a voice parameter (VP) value. 
Pursuant to a second method that incorporates principles of the invention, the voicing 
parameter is determined by finding an appropriate transition threshold, which indicates the 
amount of voicing present in a frame. Pursuant to the second method, the voiced/unvoiced 
decision is made for each band of harmonics with a single band comprising three harmonics. 
For each band a spectrum is synthesized twice: first assuming all the harmonics are voiced, 
and again assuming all the harmonics are unvoiced. An error for each synthesized spectra is 
obtained by comparing the respective synthesized spectrum with the original spectrum over 
each band. If the voiced error is less than the unvoiced error, the band is marked voiced, 
otherwise it is marked unvoiced. 

Another aspect of the invention features an improved imvoiced synthesis method that 
reduces the amount of computation required to perform unvoiced synthesis, without 
compromising quality, hastead of generating a time domain random sequence and then 
performing an FFT to generate random phases for unvoiced spectral amplitudes like earlier 
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described methods, a third method that incorporates principles of the invention directly uses a 
random generator to generate random phases for the estimated unvoiced spectral amplitudes. 

Another aspect of the invention features a method to balance an output speech 
waveform and smoothen undesired perceptual artifacts. Generally, if phase information is 
not sent to a decoder, the generated output waveform is unbalanced and will lead to 
noticeable distortions when the input level is high, due to saturation. Pursuant to a fourth 
method that incorporates principles of the invention, harmonic phases are initialized with a 
fixed set of values during transitions from unvoiced frames to voiced frames. These phases 
may be updated over successive voiced frames to maintain continuity. 

In another aspect of the invention, a linear prediction technique is used to model 
spectral amplitudes. A spectral envelope contains magnitudes of all harmonics in the frame. 
Encoding these amplitudes requires a large number of bits. Because the number of 
harmonics depends on the fimdamental frequency, the number of spectral amplitudes varies 
from frame to frame. It is more practical, therefore, to quantize the general shape of the 
spectrum, which can be assumed to be independent of the fundamental frequency. As a 
result, these spectral amplitudes are modeled using a linear prediction technique, which helps 
reduce the number of bits required for representing the spectral amplitudes. The LP 
coefficients are mapped to corresponding Line Spectral Pairs (LSP) which are then quantized 
using multi-stage vector quantization, each stage quantizing the residual of the previous one. 

In another aspect of the invention, a voicing parameter (VP) is used to reduce the 
number of bits required to transmit voicing decisions of all bands. The VP denotes a band 
threshold, under which all bands are declared unvoiced and above which all bands are 
marked voiced. Instead of a set of decisions, a single VP is now transmitted. 
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In another aspect of the invention, a fixed pitch frequency is assumed for all unvoiced 
frames and all the harmonic magnitudes are computed by taking the root mean square value 
of the frequency spectrum over desired regions. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Further objects of the invention, taken together with additional features contributing 
thereto and advantages occurring therefrom, will be apparent from the following description 
of the invention when read in conjunction with the accompanying drawings, wherein: 

Figure 1 is a block diagram of an MBE encoder that incorporates principles of the 
invention; 

Figure 2 is a block diagram of an MBE decoder that incorporates principles of the 
invention; 

Figure 3 is a block diagram that depicts an exemplary voicing parameter estimation 
method pursuant to an aspect of the invention; and 

Figure 4 is a block diagram that depicts a descriptive unvoiced speech synthesis 
method pursuant to an aspect of the invention. 

DETAILED DESCRIPTION OF THE INVENTION: 

While the invention is susceptible to use in various embodiments and methods, there 
is shown in the drawings and will hereinafter be described specific embodiments and 
methods with the understanding that the disclosure is to be considered an exemphfication of 
the invention and is not intended to limit the invention to the specific embodiments and 
methods illustrated. 

This invention relates to a low bit rate speech coder designed as a variable bit rate 
coder based on the Multi Band Excitation (MBE) technique of speech coding. 
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A block diagram of an encoder that incorporates aspects of the invention is depicted 
in Figure 1. The depicted encoder performs various functions including, for example, 
analysis of an input speech signal, parameterization and quantization of parameters. 

In the analysis stage of the encoder, the input speech is passed through block 100 to 
high-pass filter the signal to improve pitch detection, for situations where samples are 
received through a telephone channel. The output of block 100 is passed to a voice activity 
detection module, block 101. This block performs a first level active speech classification, 
classifying firames as voiced and voiceless. The frames classified voiced by block 101 are 
sent to block 102 for coarse pitch estimation. The voiceless frames are passed directly to 
block 105 for spectral amplitude estimation. 

During coarse pitch estimation (block 102), a synthetic speech spectrum is generated 
for each pitch period at half sample accuracy, and the synthetic spectrum is then compared 
with the original spectrum. Based on the closeness of the match, an appropriate pitch period 
is selected. The coarse pitch is obtained and further refined to quarter sample accuracy in 
block 103 by following a procedure similar to the one used in coarse pitch estimation. 
However, during quarter sample refinement, the deviation is measured only for higher 
frequencies and only for pitch candidates around the coarse pitch. 

Based on the pitch estimated in block 103, the current spectrum is divided into bands 
and a voiced/unvoiced decision is made for each band of harmonics in block 104 (a single 
band comprises three harmonics). For each band, a spectrum is synthesized, first assuming 
all the harmonics in the band are voiced, and then assuming all the harmonics in the band are 
unvoiced. An error for each synthesized spectra is obtained by comparing the respective 
synthesized spectrum with the original spectrum over each band. If the voiced error is less 
than the unvoiced error, the band is marked voiced, otherwise it is marked unvoiced. 
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In order to reduce the number of bits required to transmit the voicing decisions found 
in block 104, a Voicing Parameter (VP) is introduced. The VP denotes the band threshold, 
under which all bands are declared xmvoiced and above which all bands are marked voiced. 
Instead of a set of decisions, a single VP is calculated in block 107. 

Speech spectral amplitudes are estimated by generating a synthetic speech spectrum 
and comparing it with the original spectrum over a frame. The synthetic speech spectrum of 
a frame is generated so that distortion between the synthetic spectrum and the original 
spectrum is minimized in a sub-optimal manner in block 105. 

Spectral magnitudes are computed differently for voiced and unvoiced harmonics. 
Unvoiced harmonics are represented by the root mean square value of speech in each 
unvoiced harmonic frequency region. Voiced harmonics, on the other hand, are represented 
by synthetic harmonic amplitudes, which accurately characterize the original spectral 
envelope for voiced speech. 

The spectral envelope contains magnitudes of each harmonic present in the frame. 
Encoding these amplitudes requires a large number of bits. Because the number of 
harmonics depends on the fundamental frequency, the number of spectral amplitudes varies 
from frame to frame. Consequently, the spectrum is quantized assuming it is independent of 
the fimdamental frequency, and modeled using a linear prediction technique in blocks 106 
and 108. This helps reduce the number of bits required to represent the spectral amplitudes. 
LP coefficients are then mapped to corresponding Line Spectral Pairs (LSP) in block 109, 
which are then quantized using multi-stage vector quantization. The residual of each 
quantizing stage is quantized in a subsequent stage in block 110. 

The block diagram of a decoder that incorporates aspects of the invention is illustrated 
in Figure 2. Parameters from the encoder are first decoded in block 200. A synthetic speech 

10 



utility Patent Apphcation l;7416/78599 

Spectrum is then reconstracted using decoded parameters, including a fundamental frequency 
value, spectral envelope information and voiced/unvoiced characteristics of the harmonics. 
Speech synthesis is performed differently for voiced and unvoiced components and 
consequently depends on the voiced/unvoiced decision of each band. Voiced portions are 
synthesized in the time domain whereas unvoiced portions are synthesized in the frequency 
domain. 

The spectral shape vector (SSV) is determined by performing a LSF to LPC 
conversion in block 201. Then using the LPC gain and LPC values computed during the LSF 
to LPC conversion (block 201), a SSV is computed in block 202. The SSV is spectrally 
enhanced in block 203 and inputted into block 204. The pitch and VP from the decoded 
stream are also inputted into block 204. In block 204, based on the voiced/unvoiced decision, 
a voiced or unvoiced synthesis is carried out in blocks 206 or 205, respectively. 

An unvoiced component of speech is generated from harmonics that are declared 
unvoiced. Spectral magnitudes of these harmonics are each allotted a random phase 
generated by a random phase generator to form a modified noise spectrum. The inverse 
transform of the modified spectrum corresponds to an unvoiced part of the speech. 

Voiced speech represented by individual harmonics in the frequency domain is 
synthesized using sinusoidal waves. The sinusoidal waves are defined by their amphtude, 
frequency and phase, which were assigned to each harmonic in the voiced region. 

The phase information of the harmonics is not conveyed to the decoder. Therefore, in 
the decoder, at transitions from an unvoiced to a voiced frame, a fixed set of initial phases 
having a set pattem is used. Continuity of the phases is then maintained over the frames. Li 
order to prevent discontinuities at edges of the frame due to variations in the parameters of 
adjacent frames, both the current and previous frame's parameters are considered. This 
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ensures smooth transitions at boundaries. The two components are then finally combined to 
produce a complete speech signal by conversion into PCM samples in block 207. 

Most sinusoidal coders, including the MBE vocoder, crucially depend on accurately 
reproducing the harmonic structure of spectra for voiced speech segments. Since harmonics 
are merely multiples of the pitch frequency, the pitch parameter assumes a central role in the 
MBE scheme. As a result, other parameters in the MBE coder are dependent on the accurate 
estimation of the pitch period. 

Although there have been many pitch estimation algorithms, each one has its own 
limitation. Deviations between the pitch estimates of consecutive frames are bound to occur 
and these errors produce artifacts, which are essentially perceived. Therefore, in order to 
improve the pitch estimate by preventing abrupt changes in the pitch trajectory, a good 
tracking algorithm that ensures consistent pitch estimates of consecutive frames is required. 
Further, in order to remove the pitch doubling and tripling errors, a sub-multiple check 
algorithm, which supplements the pitch tracking algorithm, is required. Thus, ensuring 
correct pitch estimation in a frame. 

In the MBE scheme of the INMARSAT M voice codec (Digital voice systems Inc. 
1991, version 3.0 August 1991), the pitch tracking module used attempts to improve a pitch 
estimate by limiting the pitch deviation between consecutive frames, as follows: 

In the INMARSAT M voice codec, an error function, E(P), which is a measure of 
spectral error between the original and S5nithesized spectrum and which assumes harmonic 
structure at intervals corresponding to a pitch period (P) is calculated. If the criterion for 
selecting pitch were based strictly on error minimization of a current frame, the pitch estimate 
may change abruptly between succeeding frames, causing audible degradation in synthesized 
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speech. Hence, two previous and two future frames are considered while tracking in the 
INMARSAT M voice codec. 

For each speech frame, two different pitch estimates are computed: (1) the backward 
pitch estimate calculated using look-back tracking, and (2) the forward pitch estimate 
calculated using look-ahead tracking. 

The look-back tracking algorithm of the INMARSAT M voice codec uses information 
from two previous frames. P-2 and P.] denote initial pitch estimates calculated during 
analysis of the two previous frames, respectively, and E-2(P-2) and E-i(P_i) denote their 
corresponding error functions. 

In order to find Po, an error function E(Po) is evaluated for each pitch candidate falling 
in the range: 



The Po value corresponding to the minimum error (E(Po)) is selected as the backward pitch 
estimate (Pb), and the cumulative backward error (CEb) is calculated using the equation: 



Look-ahead tracking attempts to preserve continuity between future speech frames. 
Since pitch has not been determined for the two future frames being considered, the look- 
ahead pitch tracking of the INMARSAT M voice codec selects pitch for these frames, Pi and 
P2, after assuming a value for Pq. Pitch is selected for Pj so that Pi belongs to {21, 
21. 5. ...114}, and pursuant to the relationship: 



0.8Ri<=P0 <=L2P-i. 



(1) 



CEb(Pb) - E(Pb) + E.i(P.i) + E_2(P.2). 



(2) 



0.8Po<-Pi <=1.2 Po 



(3) 



Pitch is selected for P2 so that P2 belongs to {21 ,2 1 .5 114}, and pursuant to the 



relationship: 



0.8Pi<=P2<=L2 Pi. 



(4) 
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Pi and P2 are selected so their combined errors [Ei(Pi) + E2(P2)] are minimized. 

The cumulative forward error is then calculated pursuant to the equation: 
CEp(Po) = E(Po) + Ei(Pi) + E2(P2)- (5) 

The process is repeated for each Po in the set (21, 21.5, ... 114), and the Po value 
corresponding to a minimum cumulative forward error CEf(Po) is selected as the forward 
pitch estimate. 

Once Po is determined, the integer sub-multiples of Po (i-e. Po/2, Po/3, ...Po/n) are 
considered. Every sub-multiple, which is greater than or equal to 21 is computed and replaced 
with the closest half sample. The smallest of these sub-multiples is applied to constraint 
equations. If the sub-multiple satisfies the constraint equations, then that value is selected as 
the forward pitch estimate Pp. This process continues until all the sub-multiples, in ascending 
order, have been tested against the constraint equations. If no sub-multiple satisfies these 
constraints, 
then Pf = Po* 

The forward pitch estimate is then used to compute the forward cumulative error as 
follows: 

CEf(Pf) = E(Pf) + Ei(Pi) + E2(P2) (6) 

Next, the forward cumulative error is compared against the backward cumulative 
error using a set of heuristics. This comparison determines whether the forward pitch 
estimate or the backward pitch estimate is selected as the initial pitch estimate for the current 
frame. 

The discussed algorithm of the INMARSAT M voice codec requires information from 
two previous frames and two fixture frames to determine the pitch estimate of a current frame. 
This means that in order to estimate the pitch of a current frame, a two fiiture frame wait is 
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required. This increases algorithmic delay in the encoder. The algorithm of the INMARSAT 
M voice codes is also computationally expensive. 

An illustrative pitch tracking method, pursuant to an aspect of the invention, that 
circumvents these problems and improves performance is described below. 

Pursuant to the invention, the illustrative pitch tracking method is based on the 
closeness of a spectral match between the original and the synthesized spectrum for different 
pitch periods, and thus exploits the fact that the correct pitch period corresponds to a minimal 
spectral error. 

In the illustrative pitch tracking method, five pitch values of the current fi-ame which 
have the least errors (E(P)) associated with them are considered for tracking since the pitch of 
the current frame will most likely be one of the values in this set. Five pitch values of a 
previous frame, which have the least errors associated with them, and five pitch values of a 
future frame, which have the least error (E(P)) associated with them, are also selected for 
tracking. 

All possible paths are then traced through a treUis that includes the five pitch values 
corresponding to five E(P) minima of the previous fi-ame in a first stage, five pitch values 
corresponding to five E(P) minima of the current frame in a second stage, and five pitch 
values corresponding to five E(P) minima of the fiiture frame in a third stage. A cumulative 
error fimction, called the Cost Function (CF), is evaluated for each path: 
CF = k * (E-i + E.k) + log(P.]/P.k) + k * (E.k + Kj) + log(P.k / P-j). (7) 

CF is the total error defined over a frajectory. P.i, is a selected pitch value for the 
previous frame, P.k is a selected pitch value for the current frame, and P.j is a selected pitch 
value for a future frame, E.i is an error value for P.i, E.k is an error value for P.k, E.j is an error 
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value for P.j, and k is a penalizing factor that has been tuned for optimal performance. The 
path having the minimum CF value is selected. 

Depending on the type of previous and future frames, different cases arise, each of 
which are treated differently. If the previous frame is unvoiced or silence, then the previous 
frame is ignored and paths are traced between pitch values of the current frame and the future 
frame. Similarly, if the future frame is not voiced, then only the previous frame and current 
frame are taken into consideration for tracking. 

By using pitch values lying in the path of minimum error, backward and forward pitch 
estimates can be computed with which the initial pitch estimate of the current frame can be 
evaluated, as explained below. 

For the illustrative pitch tracking method, let Po denote the pitch of the current frame 
lying in the least error path and E(Po) denote the associated error function. 

Once Po is determined, the integer sub-multiples of Po (i.e. Po/2, Po/3, ...Po/n) are 
considered. Every sub-multiple, which is greater than or equal to 21 is computed and 
replaced with the closest half sample. The smallest of these sub-multiples is checked with 
backward constraint equations. If the sub-multiple satisfies the backward constraint 
equations, then that value is selected as the backward pitch estimate Pb^ This process 
continues until all the sub-multiples, in ascending order, have been tested by the backward 
constraint equations. If no sub-multiple satisfies the backward constraint equations, then Pq 
is selected as the backward pitch estimate (Pb = Po). 

The backward pitch estimate is then used to compute the backward cumulative error 
by applying the equation: 

CEb(Pb)-E(Pb) + E.i(Ri). (8) 
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To calculate the forward pitch estimate, according to the illustrative pitch tracking 
method, a sub-multiple check is performed and checked with forward constraint equations. 
Examples of acceptable forward constraint equations are hsted below. 
CEF(Po/n) < 0.85 and CEF(Po/n)/CEF(Po) < 1*7 (9) 
CEF(Po/n) < 0.4 and CEF(Po/n)/CEF(Po) < 3.5 (10) 
CEF(Po/n) < 0.5 (11) 

The smallest sub-multiple which satisfies the forward constraint equations is selected 
as the forward pitch estimate Pp. If a sub-multiple does not satisfy the forward constraint 
equations, Pq is selected as the forward pitch estimate (Pf = Po). 

The forward pitch estimate is then used to calculate the forward cumulative error by 
applying the equation: 

CEf(Pf) = E(Pf) + Ki(P.i) (12) 

Pursuant to the illustrated pitch tracking method, the forward and backward 
cumulative errors are then compared with one another based on a set of decision rules, 
depending on which estimate is selected as the initial pitch candidate for the current frame. 

The illustrated pitch tracking method, which incorporates principles of the invention, 
addresses a number of shortcomings prevalent in tracking algorithms in use. First, the 
illustrated method uses a single frame look-ahead compared to a two frame look-ahead, and 
thus reduces algorithmic delay. Moreover, it can use a sub-multiple check for backward pitch 
estimation, thus increasing pitch estimate accuracy. Further, it reduces computational 
complexity by using only five pitch values per selected frame. 

A speech signal comprises of silence, voiced segments and unvoiced segments. Each 
speech signal category requires different types of information for accurate reproduction 
during the synthesis phase. Voice segments require information regarding fundamental 
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frequency, degree of voicing in the segment and spectral amplitudes. Unvoiced segments, on 
the other hand, require information regarding spectral amplitudes for natural reproduction. 
This apphes to silence segments as welh 

A speech classifier module is used to provide a variable bit rate coder, and, in 
general, to reduce the overall bit rate of the coder. The speech classifier module reduces the 
overall bit rate by reducing the number of bits used to encode unvoiced and silence frames 
compared to voiced frames. 

Coders in use have employed voice activity detection (VAD) and active speech 
classification (ASC) modules separately. These modules are based on characteristics such as 
zero crossing rate, autocorrelation coefficients and so on. 

A descriptive speech classifier method, which incorporates principles of the invention, 
is described below. The described speech classifier method uses several characteristics of a 
speech frame before making a speech classification. Thus the classification of the descriptive 
method is accurate. 

The described speech classifier method performs speech classification in three steps. 
In the first step, an energy level is used to classify frames as voiced or voiceless at a gross 
level. The base noise energy level of the frames is tracked and the minimum noise level 
encountered corresponds to a background noise level. 

Pursuant to the descriptive speech classifier method, energy in the 60-1000 Hz band is 
determined and used to calculate the ratio of the determined energy to the base noise energy 
level. The ratio can be compared with a threshold derived from heuristics, which threshold is 
obtained after testing over a set of 15000 frames having different background noise energy 
levels. If the ratio is less than the threshold, the frame is marked unvoiced, otherwise it is 
marked voiced. 
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The threshold is biased towards voiced frames, and thus ensures voiced frames are not 
marked unvoiced. As a result, unvoiced frames may be marked voiced, hi order to correct 
this, a second detailed step of classification is carried out which acts as an active speech 
classifier and marks frames as voiced or unvoiced. The frames marked voiced in the previous 
step are passed through this module for more accurate classification. 

Pursuant to the descriptive speech classifier method, voiced and unvoiced bands are 
classified in the second classification step module. This module determines the amount of 
voicing present at a band level and a frame level by dividing a spectrum of a frame into 
several bands, where each band contains three harmonics. Band division is based on the 
pitch frequency of the frame. The original spectrum of each band is then compared with a 
synthesized spectrum that assumes harmonic structure. A voiced and imvoiced band decision 
is made on the comparison. If the match is close, the band is declared voiced, otherwise it is 
marked unvoiced. At the frame level, if all the bands are marked unvoiced, the frame is 
declared unvoiced, otherwise it is declared voiced. 

To distinguish silence frames from unvoiced frames, in the descriptive speech 
classifier method, a third step of classification is employed where the frame's energy is 
computed and compared with an empirical threshold value. If the frame energy is less than 
the threshold, the frame is marked silence, otherwise it is marked unvoiced. The descriptive 
speech classifier method makes use of the three steps discussed above to accurately classify 
silence, unvoiced and voiced frames. 

In summary, the descriptive speech classifier method uses multiple measures to 
improve Voice Activity Detection (VAD). In particular, it uses spectral error as a criterion 
for determining whether a frame is voiced or unvoiced. This is very accurate. The method 
also uses an existing voiced-unvoiced band decision module for this purpose, thus reducing 
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computation. Further, it uses a band energy-tracking algorithm in the first phase, making the 
algorithm robust to background noise conditions. 

In the multi-band excitation (MBE) model, a single voiced-unvoiced classification of 
a classical vocoder is replaced by a set of voiced-unvoiced decisions taken over harmonic 
intervals in the frequency domain. Li order to obtain natural quality speech, it is imperative 
that these band voicing decisions are accurate. The band voicing classification algorithm 
involves dividing the spectrum of the frame into a number of bands, wherein each band 
contains three harmonics. The band division is performed based on the pitch frequency of the 
frame. The original spectrum of each band is then compared with a spectrum that assumes 
harmonic structure. Finally, the normalized squared error between the original and the 
synthesized spectrum over each band is computed and compared with the energy dependent 
threshold value and declared voiced if the error is less than the threshold value, otherwise it is 
declared voiced. The voicing parameter algorithm, which has been used in the INMARSAT 
M voice codec (Digital voice systems Inc. 1991, version 3.0 August 1991) rehes on frame 
energy change, the updation of which is not up to standards, for its threshold. 

In other algorithms, errors occurring in the voiced/unvoiced band classification can be 
characterized in two different ways: (a) coarse and fine, and (b) Voiced classification as 
unvoiced and vice versa. 

The frame, as a whole, can be wrongly classified, in which case the error is 
characterized as a coarse error. Sudden surges or dips in the voicing parameter also come 
under this category. If the error is restricted to one or more bands of a frame then the error is 
characterized as a fine error. The coarse and fine errors are perceptually distinguishable. 
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A voicing error can also occur as a result of a voiced band marked unvoiced or an 
unvoiced band marked voiced. Either of these errors can be coarse or fine, and are audibly 
distinct. 

A coarse error spans over an entire frame and results in each voiced band being 
marked unvoiced, the production of unwanted clicks, and if the error persists over a few 
frames, the introduction of one type of hoarseness into the decoded speech. Coarse errors 
that involve unvoiced bands of a frame being inaccurately classified as voiced cause phantom 
tone generation, which produces a ringy effect in the decoded speech. If this error occurs 
over two or more consecutive frames, the ringy effect becomes very pronounced, further 
deteriorating decoded speech quality. 

On the other hand, fine errors that are biased towards unvoicing over a set of frames 
introduce a husky effect into the decoded speech while those biased towards voicing result in 
overvoicing, thus producing a tonal quality in the output speech. 

An exemplary voicing parameter (VP) estimation method that incorporates principles 
of the invention is described below. The exemplary VP estimation method is independent of 
energy threshold values. Pursuant to the exemplary method, the complete spectrum is 
synthesized assuming each band is unvoiced, i.e. each point in the spectrum over a desired 
region is replaced by the root mean square (r.m.s) value of spectrum amplitude over that 
band. The same spectrum is also synthesized assuming each band is voiced, Le. a harmonic 
structure is imposed over each band using a pitch frequency. But, when imposing the 
harmonic structure over each band, it is assured that a valley between two consecutive 
harmonics is not below an actual valley of corresponding harmonics in the original spectrum. 
This is achieved by clipping each synthesized valley amphtude to a minimum value of the 
original spectrum between the corresponding two consecutive harmonics. 
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Next, in the exemplary VP estimation method, the mean square error over each band 
for both spectrums is computed from the original spectrum. If the error between the original 
spectrum and the synthesized spectrum that assumes an unvoiced band is less than the error 
between the original spectrum and synthesized spectrum that assumes a voiced band 
(harmonic structure over that band), the band is declared unvoiced, otherwise it is declared 
voiced. The same process is repeated for the remaining bands to get the voiced-unvoiced 
decisions for each band. 

Figure 3 shows a block diagram of the exemplary VP estimation method, hi block 
300, the entire spectrum is synthesized for each harmonic assuming each harmonic is voiced. 
The spectrum is synthesized using pitch frequency and actual spectrum information for the 
frame. The complete harmonic structure is generated by using the pitch frequency and 
centrally placing the standard Hamming window of required resolution around actual 
harmonic ampUtudes. Block 301 represents the complete spectrum (i.e. the fixed point FFT) 
of the original input speech signal. 

In block 302, the entire spectrum is synthesized for each harmonic assuming each 
harmonic is unvoiced. The complete spectrum is synthesized using the root mean square 
(r.m.s) value for each band over that region in the actual spectrum. Thus, the complete 
spectrum is synthesized by replacing actual spectrum values in that region by the r.m.s value 
in that band. In block 303, valley compensation between two successive harmonics is used 
to ensure that the synthesized valley amplitude between corresponding successive harmonics 
is not less than the actual valley amplitude between corresponding harmonics. In block 304, 
the mean square error is computed over each band between the actual spectrum and the 
synthesized spectrum assuming each harmonic is voiced. In block 305, the mean square error 
is computed over each band between the actual spectrum and the synthesized spectrum 
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assuming each harmonic is unvoiced (each band is replaced by its r.m.s. value over that 
region). In block 306, the unvoiced error for each band is compared with the voiced error for 
each band; The voiced-unvoiced decision is determined for each band by selecting the band 
decision having minimum error in block 307. 

For the exemplary VP estimation method, let Sorg(m) be the original frequency 
spectrum of a frame, and let Ssyn,h(m,Wo) be the synthesized spectrum of the frame that 
assumes a harmonic structure over the entire spectrum and that uses a fundamental frequency. 
Wo . The fundamental frequency Wo is used to compute the error from the original spectrum 

Sorg{m) . 

Let Ssrmsim) be the synthesized spectrum of the current frame that assumes an 
unvoiced frame. Spectrum points are replaced by the root mean square values of the original 
spectrum over that band (each band contains three harmonics except the last band, which 
contains the remaining number of the total harmonics). 

Let erroruv(k) be the mean squared error over the k* band between the frequency 
spectrum ( Sorg{m) ) and the spectrum that assumes an unvoiced frame ( Ssrms{m) ). 
errorur(k) = ((Sorgim) - Srmsim)) * iSorg(m) - Srmsim))) I N (13) 
N is the total number of points used over that region to compute the mean square error. 

Similarly, let errorvoiced(k) be the mean squared error over the k* band between the 
frequency spectrum Sorgim) and the spectrum that assumes a harmonic structure 

{Ssynth{m,'Wo)). 

errOrvo,ced{k) = ((Sorg(m) - Ssyntkim)) * (Sorgim) - Ssynth(m))) I N (14) 

Pursuant to the exemplary VP estimation method, the k* band is declared voiced if 
the error.oiced{k) is less than the erroruv{k) over that region, otherwise the band is declared 
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unvoiced. Similarly, each band is checked to determine the voiced-unvoiced decisions for 



Pursuant to an illustrative Voicing Parameter (VP) threshold estimation method that 
incorporates principles of the invention, a VP is introduced to reduce the number of bits 
required to transmit voicing decisions for each band. The VP denotes a band threshold, under 
which all bands are declared tmvoiced and above which all bands are marked voiced. Hence, 
instead of a set of decisions, a single VP can be transmitted. Experimental results have 
proved that if the threshold is determined correctly, there will be no perceivable deterioration 
in decoded speech quality. 

The illustrative voicing parameter (VP) threshold estimation method uses a VP for 
which the hamming distance between the original and the synthesized band voicing bit 
strings is minimized. As a further extension, the number of voiced bands marked unvoiced 
and that of unvoiced bands marked voiced can be penalized differentially to conveniently 
provide a biasing towards either. Pursuant to the illustrative VP threshold estimation method, 
the final form of the weighted bit error for a band threshold at the k^^ band is given by: 



ai, i = l,....,m are the original binary band decisions and Cv is a constant that governs 
differential penalization. This removes sudden transitions from the voicing parameter. 

In sum, degradation in decoded speech quahty due to errors in VP estimation have 
been minimized using the illustrative VP threshold estimation method. Most problems 
inherent in previous voiced-unvoiced band classifications used in the INMARSAT M voice 



each band. 



k 



m 




(15) 
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codec (Digital voice systems Inc. 1991, version 3.0 August 1991) have also been eliminated 
by replacing the previous module by the exemplary voicing parameter estimation method and 
the illustrative voicing parameter (VP) threshold estimation method, which also improves 
decoded speech quality. 

In an MBE based decoder, voiced and unvoiced speech synthesis is done separately, 
and unvoiced synthesized speech and voiced synthesized speech is combined to produce 
complete synthesized speech. Voiced speech synthesis is done using standard sinusoidal 
coding, while unvoiced speech synthesis is done in the frequency domain. In the 
INMARSAT M voice codec (Digital voice systems Inc. 1991, version 3.0 August 1991), to 
generate unvoiced speech, a random noise sequence of specific length is initially generated 
and its Fourier transform is taken to generate a complete imvoiced spectrum. Then, the 
spectrum amplitudes of a random noise sequence are replaced by actual unvoiced spectral 
ampUtudes, keeping phase values equal to those of the random noise sequence spectrum. The 
rest of the amphtude values are set to zero. As a result, the unvoiced spectral amplitudes 
remain unchanged but their phase values are replaced by the actual phases of the random 
noise sequence. 

Later, the inverse Fourier transform of the modified unvoiced spectrum is taken to get 
the desired unvoiced speech. Finally, the weighted overlap method is applied to get the 
actual unvoiced samples using the current and previous unvoiced speech samples using a 
standard synthesis window of desired length. 

The unvoiced speech synthesis algorithm used in the INMARSAT M voice codec is 
computationally complex and involves both Fourier and inverse Fourier transforms of the 
random noise sequence and modified unvoiced speech spectrum. A descriptive unvoiced 
speech synthesis method that incorporates principles of the invention is described below. 
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The descriptive unvoiced speech synthesis method only involves one Fourier 
transform, and consequently reduces the computational complexity of unvoiced synthesis by 
one-half with respect to the algorithm employed in the INMARSAT M voice codec (Digital 
voice systems hic. 1991, version 3.0 August 1991). 

Initially, pursuant to the descriptive unvoiced speech synthesis method, a random 
noise sequence of desired length is generated and, later, each generated random value is 
transformed to get random phases, which are uniformly distributed between negative n and n. 
Then, random phases are assigned to an actual unvoiced spectral amplitude to get a modified 
unvoiced speech spectrum. Finally, the inverse Fourier transform is taken for the unvoiced 
speech spectrum to get a desired unvoiced speech signal. However, since the length of the 
synthesis window is longer than the frame size, the unvoiced speech for each segment 
overlaps the previous frame. A weighted Overiap Add method is apphed to average these 
sequences in the overlapping regions. 

Let U(n) be the sequence of random numbers, which are generated using the 

equation: 

C/(n + l) = 171*i7(/2) + 11213-53125*L(171*?7(n) + 11213)/53125j (16) 
L J represent the integer part of the fractional number, and U{0) is initially set to 3 147. 
Alternatively, the randomness in the unvoiced spectrum may be provided by using a different 
random noise generator. This is within the scope of this invention. 

Pursuant to the descriptive unvoied speech synthesis method, each random noise 
sequence value is computed from equation 16 and, later, each random value is transformed 
between negative ;rand ;r. Let Samp(l) be the amplitude of the harmonic. The random 
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phases are assigned to the actual spectral amplitudes, and the modified unvoiced spectrum 
over the 1* harmonic region is given by: 

U^im) = Samp{l) * (cos(<^) + 7 sin(^?>)) (17) 
(f> is the random phase assigned to the 1* harmonic. 

Last, the inverse Fourier transform is taken for C/w(m) to get the unvoiced signal in 
the time domain using the equation: 

m=NI2-\ 

u{n)^\IN J^C/(m)exp((7=^2=^;r*m*w)/A^) For NI2<n<NI2-\ (18) 

N is the number of FFT points used for inverse computation. 

Later, to get the actual unvoiced portion of the current frame, a weighted overlap 
method is used on the current and the previous frame unvoiced samples using a standard 
synthesis window. Blocks 401,402 and 403 (Figure 4) are used to generate random phase 
values, to assign these phase values to the spectral ampHtudes and to take an inverse FFT to 
compute unvoiced speech samples for the current frame. The descriptive unvoiced speech 
synthesis method reduces the computational complexity by one-half (by reducing one FFT 
computation) with respect to the unvoiced speech synthesis algorithm used in INMARSAT M 
voice codec (Digital voice systems Inc. 1991, version 3.0 August 1991), without any 
degradation in output speech quality. 

Phase information plays a fimdamental role, especially in voiced and transition parts 
of speech segments. To maintain good quality speech, phase information must be based on a 
well-defined strategy or model. 

In the INMARSAT M voice codec (Digital voice systems Inc. 1991, version 3.0 
August 1991), phase initiahzation for each harmonic is performed in a specific manner in the 
decoder, Le, initial phases for the first one fourth of the total harmonics are linearly related 
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with the pitch frequency, while the remaining harmonics in the beginning of the first frame 
are initiahzed randomly and later updated continuously over successive frames to maintain 
harmonic continuity. 

The INMARSAT M voice codec phase initialization scheme is computationally 
intensive. Also, the output speech waveform is biased in an upward or downward direction 
along the axes. Consequently, chances of speech sample saturation are high, which leads to 
unwanted distortions in output speech. 

An illustrative phase initiahzation method that incorporates principles of the invention 
is described below. The illustrative phase initialization method is computationally simple 
with respect to the algorithm used in INMARSAT M voice codec (Digital voice systems Inc. 
1991, version 3.0 August 1991). 

In the illustrative phase initiahzation method, phases for each harmonic are initiahzed 
with a fixed set of values for each transition from completely unvoiced frames to voiced 
frames. These phases are later updated over successive voiced frames to maintain continuity. 
The initial phases are related to get a balanced output speech waveform. This output speech 
waveform is balanced on either side of the axis. 

The fixed set of phase values eliminate the chance of sample values getting saturated, 
and thereby remove unwanted distortions in the output speech. One set of phase values, 
which provide a balanced waveform, is listed below. These are values to which phases of the 
harmonics get initiahzed (Usted column-wise in increasing order of harmonic number) 
whenever there is a transition from an imvoiced frame to voiced frame. 
Harmonic phase values = { 

0.000000, -2.008388, -0.368968, -0.967567, 
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-2 . 077636, 


-1.009797, 


-0 . 129658, 


-0.903947, 


-0 .699374, 


-1.705878, 


0 .425315, 


-0.903947, 


-0.853920, 


-0.127823, 


-0 .897955, 


-0.903947, 


-1 .781785, 


-2 . 051089, 


0 .511909, 


-0.903947, 


-0 .588607, 


-1. 063303, 


-0 . 957640, 


-0 . 903947, 


-1,430010, 


-0. 009230, 


-2.185920, 


-0 . 903947, 


0 . 650081, 


-0 .490472, 


-0 . 631376, 


-0 . 903947, 


-0.4 1466 8 , 


-2 .307083 , 


-2 .315562, 


-0 .903947, 


-1.733431, 


-0.299851, 


-0 .901923, 


-0.903947, 


0 . 060934, 


-1 . 878630, 


-2 .362951, 


-0 . 903947, 


-1.085355, 


-0 .088243 , 


-0 . 926879, 


-0.903947, 


-1. 994504, 


-1.295832, 


0 .495461, 





} 
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The illustrative phase initialization method is computationally simpler with respect to the 
algorithm of the INMARSAT M voice codec (Digital voice systems Inc. 1991, version 3.0 
August 1991). The illustrative method also provides balanced output waveform, which 
eliminates the chance of unwanted output speech distortions due to saturation. The fixed set 
of phases also gives the decoded output speech a slightly smoother quality than that of the 
INMARSAT M voice codec (Digital voice systems Inc. 1991, version 3.0 August 1991), 
especially in voiced regions of speech. 

A different set of phase values that follow the same set pattern could also be used. 
This is within the scope of this invention. 

From the foregoing it will be observed that numerous modifications and variations 
can be effectuated without departing fi:om the true spirit and scope of the novel concepts 
of the invention. It is to be understood that no limitation with respect to the exemplary use 
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illustrated is intended or should be inferred. The disclosure is intended to cover by the 
appended claims all such modifications as fall within the scope of the claims. 
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CLAIMS 

What is claimed is: 

1 . A method for processing a signal, the method comprising the steps of: 
dividing the signal into frames, each frame having a corresponding spectrum; 
selecting a plurahty of pitch candidates from a first frame; 

selecting a plurality of pitch candidates from a second frame; 
calculating a cumulative error ftinction for a plurality of paths, each path 
including a pitch candidate from the first frame and a pitch candidate from the second frame; 
selecting a path corresponding to a low cumulative error fimction; 
basing a pitch estimate for a current frame on the selected path; 
using the pitch estimate for the current frame to process the signal. 

2. The method of claim 1, wherein the determining step fiirther comprises the 
step of synthesizing the signal, the synthesized signal having corresponding spectra, wherein 
the pitch candidates are obtained from the synthesized spectra. 

3. The method of claim 1 wherein the first frame is a previous frame and the 
second frame is a current frame. 

4. The method of claim 1 wherein the first frame is a current frame and the 
second frame is a ftiture frame. 

5. The method of claim 1 wherein the plurahty of pitch candidates for the first 
frame is no more than five pitch candidates and the plurality of pitch candidates for the 
second frame is no more than five pitch candidates. 

6. The method of claim 5 wherein a cumulative error fianction is calculated for 
all possible paths. 

7. The method of claim 1 wherein the selected pitch candidates for the first and 
second frames have low error fimctions. 
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8. The method of claim 7 wherein the error function is a measure of the spectral 
error between original and synthesized spectra. 

9. The method of claim 1 further comprising the step of selecting a plurality of 
pitch candidates for a third frame, and wherein each path further includes a pitch candidate 
from the third frame. 

10. The method of claim 9 wherein the plurality of pitch candidates for the first 
frame is no more than five pitch candidates, the plurality of pitch candidates for the second 
frame is no more than five pitch candidates and the plurality of pitch candidates for the third 
frame is no more than five pitch candidates. 

11. The method of claim 10 wherein a cumulative error function is calculated for 
all possible paths. 

12. The method of claim 9 wherein the first frame is a previous frame, the second 
frame is a current frame and the third frame is a fiiture frame, 

13. The method of claim 9 wherein the selected pitch candidates for the firsts 
second and third frames have low error fimctions. 

14. The method of claim 13 wherein the error fimction is a measure of the spectral 
error between original and synthesized spectra. 

15. The method of claim 14 wherein a cumulative error fimction for each path is 
defined by the equation: 

CF = k * (E_i + R2) + log (P.i / P.2) + k * (E.2 + E_3) + log(P_2 / P-3) 
wherein P_i is a selected pitch candidate for the first frame, P.2 is a selected pitch candidate 
for the second frame, P_3 is a selected pitch estimate for the third frame, E-i is an error for P.i, 
E-2 is an error for P,2, E_3 is an error for P_3, and k is a penaHsing factor. 
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16. The method of claim 9 wherein the basing a pitch estimate for a current frame 
on the selected path step further comprises calculating a backward pitch estimate along the 
selected path, wherein the pitch estimate for a current frame is based on the selected path and 
the backward pitch estimate. 

17. The method of claim 16 wherein the backward pitch estimate is calculated by 
calculating backward sub-multiples of a pitch candidate for the second frame in the selected 
path, determining whether the backward sub-multiples satisfy backward constraint equations, 
and selecting a low backward sub-multiple as the backward pitch estimate wherein the pitch 
candidate for the second frame in the selected path is selected as the backward pitch estimate 
if a backward sub-multiple does not satisfy the backward constraint equations. 

18. The method of claim 17 wherein the basing a pitch estimate for a current 
frame on the selected path step further includes determining a backward cumulative error 
based on the backward pitch estimate. 

1 9. The method of claim 1 8, wherein the backward cumulative error is defined by: 
CEb(Pb)-E(Pb) + E_KP.O 

wherein E(Pb) is an error of the backward pitch estimate and E.](P,i) is an error of the first 
pitch candidate. 

20. The method of claim 9 wherein the basing a pitch estimate for a current frame 
on the selected path step fiirther comprises calculating a forward pitch estimate along the 
selected path, wherein the pitch estimate for a current frame is based on the selected path and 
the forward pitch estimate. 

21. The method of claim 20 wherein the forward pitch estimate is calculated by 
calculating forward sub-multiples of a pitch candidate for the second frame in the selected 
path, determining whether the forward sub-multiples satisfy forward constraint equations, and 
selecting a low forward sub-multiple as the forward pitch estimate wherein the pitch 
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candidate for the second frame in the selected path is selected as the forward pitch estimate if 
a forward sub-multiple does not satisfy the forward constraint equations. 

22. The method of claim 21 wherein the forward constraint equation is selected 
from the group consisting of 

CEf(Po / n) < 0.85 and (CEf(Po / n)) / (CEf(Po)) < 1 .7; 
CEf(Po / n) < 0.4 and (CEf(Po / n)) / (CEf(Po)) < 3.5; and 
CEF(Po/n)<0.5 

where Po / n refers to forward sub-multiples, Po refers to the pitch candidate for the second 
frame in the selected path, and CEf(P) is an error function. 

23. The method of claim 21 wherein the basing a pitch estimate for a current 
frame on the selected path step further includes determining a forward cumulative error based 
on the forward pitch estimate, 

24. The method of claim 23, wherein the forward cumulative error is defined by: 
CEf(Pf) = E(Pf) + E_i(P_i) 

wherein E(Pf) is an error for the forward pitch estimate and E.i(P-i) is an error of the first 
pitch candidate. 

25. The method of claim 24 wherein the basing a pitch estimate for a current 
frame on the selected path step fiirther comprises calculating a backward pitch estimate along 
the selected path, wherein the backward pitch estimate is used to calculate a backward 
cumulative error, the pitch estimate being based on the selected path, the forward cumulative 
error and the backward cumulative error. 

26. The method of claim 25, wherein the basing a pitch estimate for a current 
frame on the selected path step fiirther comprises comparing the forward and backward 
ciomulative errors with one another, selecting the pitch estimate as the forward pitch estimate 
if the forward cumulative error is less than the backward cumulative error, and selecting the 
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pitch estimate as the backward pitch estimate if the backward cumulative error is less than the 
forward cumulative error 

27. The method of claim 20 wherein the basing a pitch estimate for a current 
frame on the selected path step further comprises calculating a backward pitch estimate along 
the selected path, wherein the pitch estimate for a current frame is based on the selected path, 
the forward pitch estimate and the backward pitch estimate. 

28. A method for processing a signal comprising the steps of: 
dividing the signal into frames; 

obtaining a pitch estimate for a current frame; 

refining the obtained pitch estimate comprising the sub-step of: 

computing backward and forward sub-multiples of the obtained pitch 

estimate for the current frame; 

determining whether the backward sub-multiples satisfy at least one 
backward constraint equation; 

determining whether the forward sub-multiples satisfy at least one 
forward constraint equation; 

selecting a low backward sub-multiple that satisfies the at least one 
backward constraint equation as the backward pitch estimate, wherein the obtained pitch 
estimate of the current frame is selected as the backward pitch estimate if a backward sub- 
multiple does not satisfy the at least one backward constraint equation; 

selecting a low forward sub-multiple that satisfies the at least one 
forward constraint equation as the forward pitch estimate, wherein the obtained pitch estimate 
of the current frame is selected as the forward pitch estimate if a forward sub-multiple does 
not satisfy the at least one forward constraint equation; 

using the backward pitch estimate to compute a backward cumulative 

error; 

using the forward pitch estimate to compute a forward cumulative 

error; 
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comparing the forward cumulative error to the backward cumulative 

error; 

refining the chosen pitch estimate for the current frame based on the 
comparison; and 

using the refined pitch estimate for the current frame to process the signal. 

29. A method for making voicing decisions for segments of a signal to process the 
signal, comprising the steps of: 

dividing the signal into frames, each frame having a corresponding spectrum; 
tracking a base noise energy level of previous frames; 
computing energy in a frame; 

calculating a ratio between the energy of the frame and the base noise energy 

level; 

comparing the ratio against a threshold value; and 

declaring the frame unvoiced if the ratio is less than the threshold value, 
wherein the frame is declared voiced if the ratio is greater than the threshold value; and 
using the declaration to process the signal. 

30. The method of claim 29, wherein the threshold value is derived from 
heuristics. 



3 1 . The method of claim 30, wherein the heuristics are obtained from testing a set 
of about 10,000 to about 15,000 frames with different background noise levels. 

32. The method of claim 29 further comprising the steps of: 

dividing a spectrum of a frame previously declared voiced into bands; 

comparing at least one band of the spectrum with at least one band of a voiced 
synthesized spectrum; 

comparing at least one band of the spectrum with at least one band of an 
unvoiced synthesized spectrum; 
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making a voiced/unvoiced decision for the at least one band based on the 
original-voiced comparison and the original-unvoiced comparison; and 

redeclaring the frame unvoiced if each band of the frame is marked unvoiced. 

33 . The method of claim 32, wherein each band contains about three harmonics. 

34. The method of claim 32, wherein the dividing the signal into frames step is 
based on a pitch frequency of the frame. 

35. The method of claim 32, further comprising the steps of: 
computing an unvoiced frame's energy; 

comparing the unvoiced frame's energy with an empirical threshold value; and 
redeclaring the unvoiced frame as silent if the frame has an energy level below 
the empirical threshold value. 

36. The method of claim 35, wherein a voicing parameter is determined, the 
voicing parameter being used to transmit or store the voiced/unvoiced/silence band 
information. 

37. A method to process a signal comprising the steps of: 

using a first technique to synthesize an original spectrum of the signal; 

using a second technique to synthesize the original spectrum; 

calculating a first error between a band of the original spectrum and a 
corresponding band of the first synthesized spectrum; 

calculating a second error between the band of the original spectrum and a 
corresponding band of the second synthesized spectrum; 

comparing the first error to the second error; 

declaring the band of the original spectrum as a first category band if the first 
error is less than the second error; 
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declaring the band of the original spectrum as a second category band if the 
second error is less than the first error; and 

using the declaration to process the signal. 

38. The method of claim 37, wherein the first technique comprises the step of 
ensuring that a valley amplitude between two successive harmonics of the first synthesized 
spectrum is not less than a valley amplitude between two corresponding successive 
harmonics of the original spectrum. 

39. The method of claim 38, wherein the first technique further comprises the step 
of clipping the synthesized spectrum valley amplitude to about a minimum value of the 
corresponding valley amplitude of the original spectrum. 

40. The method of claim 39, wherein the first technique further comprises the 
steps of placing a window around harmonic amplitudes to determine a pitch firequency. 

41. The method of claim 37, wherein the second technique comprises the step of 
fixing a pitch firequency and calculating a root mean square value over a region in the 
spectrum. 

42. The method of claim 37, wherein the error calculated between the band of the 
original spectrum and the corresponding band of the first synthesized spectrum is a mean 
square error, and the error calculated between the band of the original spectrum and the 
corresponding band of the second synthesized spectrum is a mean square error. 

43. The method of claim 37, wherein the original spectrum of the signal 
corresponds to a fi-ame marked voiced. 

44. The method of claim 43, wherein the first synthesizing technique is performed 
assuming a band of the spectrum is voiced, the second synthesizing technique is performed 
assuming a band of the spectrum is unvoiced, the band is declared voiced if the first error is 
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less than the second error, and the band is declared unvoiced if the second error is less than 
the first error, wherein the using a first and second technique to synthesize steps, the 
calculating steps, the comparing step, the declaring step, the using the declaration to process 
step are performed for each band of the spectrum, and a frame corresponding to the spectrum 
is declared unvoiced if each band within the frame is declared unvoiced, 

45. The method of claim 37, wherein the error of the first technique is defined by 
the equation: 

errorutO^) = [(Sorgim) - Ssynth(m)) * {Sorg{m) - Ssynth{m))] I N 
where k is the band number, Sorgij^ is the original spectrum, Ssymhip^ is the first synthesized 
spectrum, and N is a number of points used over a region to calculate a mean square error. 

46. The method of claim 37, wherein the error of the second technique is defined 
by the equation: 

error2nd(y) ^ [{Sorg{rn) - ^^^^m)) * {Sarg{rn) - Srms{rn))] I N 
where k is the band number, Sorg(j^ is the original spectrum, Srmsip^ is the second 
synthesized spectrum, and N is a number of points used over a region to calculate a mean 
square error. 

47. The method of claim 37, wherein the first synthesizing technique is performed 
assuming each band of the spectrum is voiced, the second synthesizing technique is 
performed assuming each band of the spectrum is unvoiced, the band is declared voiced and 
the voiced synthesis form is used if the first error is less than the second error, and the band is 
declared unvoiced and the unvoiced synthesis form is used if the second error is less than the 
first error; and fiarther comprising the steps of: 

determining a voicing parameter, wherein the voicing parameter denotes a 
band threshold and is transmitted or stored to convey voiced/unvoiced band information. 

48. The method of claim 47, wherein bands having properties below the voicing 
parameter are declared unvoiced and bands having properties above the voicing parameter are 
declared voiced. 
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49. The method of claim 47, wherein the voicing parameter is determined to 
minimize a hamming distance between a voicing bit string of the original band and a voicing 
bit string of the synthesized band. 

50. The method of claim 47, wherein the voicing parameter is weighted to 
compensate for voiced bands that are declared unvoiced and for unvoiced bands that are 
declared voiced by previous unvoiced/voiced band declarations. 

5 1 . The method of claim 47, wherein the following weighted bit error function is 
applied to the voicing parameter: 

k m 

s(k)-Cv i:(l-aO+ 2 ay 

wherein a/, z = 1, . . m are previous band declarations, Cv is a constant, and k is the harmonic 
number of the spectrum. 

52. A method of transmitting voicing decisions for bands of signal frames, 
comprising the steps of: 

determining a voicing parameter for which a distance between an original 
band voicing bit stream and a synthesized band voicing bit stream is minimized; 

declaring all bands having properties below the voicing parameter as imvoiced 
and all bands having properties above the voicing parameter as voiced; and, 

transmitting the voicing parameter. 

53. The method of claim 52 further comprising the steps of quantizing and 
encoding the voicing parameter. 

54. The method of claim 52, wherein the following weighted bit error function is 
applied to the voicing parameter: 

k m 

e(k) = Cv E(l-ai)+ Z a, 

(=1 j = V+\ 
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where a^, / ^ 1, m are previous voicing decisions, Cv is a constant, and k is the harmonic 
number of the spectrum. 

55. A method of transmitting or storing signal information, comprising the steps 

of: 

synthesizing a spectrum of a signal; 

modeling spectral amplitudes of the spectrum using a linear prediction 

technique; 

mapping linear prediction coefficients obtained from the linear prediction 
model to corresponding line spectral pairs; 

quantizing the line spectral pairs, wherein a residual of a previous quantizing 
stage is quantized during a current quantizing stage; 

storing or transmitting the quantized line spectral pairs. 

56. The method of claim 55 wherein multi-stage vector quantization is used 
during the quantizing step. 

57. The method of claim 55 further comprising the steps of: 

determining a voicing parameter, the voicing parameter conveying 
voiced/unvoiced band information; 

quantizing and encoding the voicing parameter; and 
storing or transmitting the quantized voicing parameter. 

58. A method for synthesizing bands of signal frames previously declared 
unvoiced, comprising the steps of: 

generating a random noise sequence; 

transforming values of the random noise sequence to random phase values; 

assigning the transformed random phase values to the spectral amplitudes to 
obtain a modified unvoiced spectrum; and 

taking an inverse Fourier transform of the modified unvoiced spectrum to 
obtain an unvoiced speech signal. 
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59. The method of claim 58, wherein the random noise sequence is defined by 
values generated from the equation: 

U(n + 1) = 171 * U{n) + 1 1213 - 53125 * L (171 U{n) + 11213) / 53125) J 
wherein L J represent the integer portion of a fractional number and [7(0) is initially set to 
3147. 

60. The method of claim 58, wherein the synthesizing step frirther includes the 
step of applying a weighted overlap add method to the unvoiced speech signal. 

61. The method of claim 58 wherein the transformed values are assigned to 
random phase values between about negative n and about positive n, 

62. The method of claim 58 wherein the transformed random phase values are 
assigned by applying the equation: 

Uw(m) = Samp(/) * (cos(^) +Jsm{^)) 
where / is a harmonic of the unvoiced spectrum and ^ is the random phase assigned to the /th 
harmonic of the unvoiced spectrum. 

63. The method of claim 58 wherein the unvoiced speech signal is obtained by 
applying the inverse Fourier transform equation: 

u(n) - 1/N * S U(/w)exp((/**2* ti* m * «)/N) 
where N/ 2 < n<{N/ 2)- 1, and N is the number of points used in the computation. 

64. A method for processing received data, comprising the steps of: 

decoding the received data to obtain signal information including unvoiced 
frame information, voiced frame information, and spectral envelope information; 

initializing phases of harmonics of the spectral envelope to a fixed set of 
values, wherein the initializing step is performed for transitions from imvoiced frames to 
voiced frames; and 
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65. The method of claim 64 wherein the initiahzed phases are related to get a 
balanced output speech waveform. 

66. The method of claim 65 wherein the fixed set of values is approximately 
defined by 



0.000000 


-1.781785 


-1.085355 


-0.009230 


-2.008388 


-0.588607 


-1.994504 


-0.490472 


-0 368968 


-1 430010 


-1.009797 


-2 307083 


-0.967567 


0.650081 


-1.705878 


-0.299851 


-2.077636 


-0.414668 


-0.127823 


-1.878630 


-0.699374 


-1.733431 


-2.051089 


-0.088243 


-0.853920 


0.060934 


-1.063303 


-1.295832 


-0.129658 


-0.631376 


-0.903947 


-0.903947 


0.425315 


-2.315562 


-0.903947 


-0.903947 


-0.897955 


-0.901923 


-0.903947 


-0.903947 


0.511909 


-2.362951 


-0.903947 


-0.903947 


-0.957640 


-0.926879 


-0.903947 


-0.903947 


-2.185920 


0.495461 


-0.903947 





wherein about 0.000000 is the first fixed value used and about -2.008388 is the second fixed 
value used so that each succeeding value is the next value used. 

67. An encoder for processing a signal comprising: 

a means for making a voicing decision for a frame of the signal; 
a means for determining a pitch value for fi-ames marked voiced; 
a means for basing an unvoiced-voiced decision for bands of frames marked 
voiced on two error fimctions, the first error function comprising a difference between a 
voiced synthesized spectrum and a spectrum of the signal and the second error function 
comprising a difference between an unvoiced synthesized spectrum and the spectrum of the 
signal; 

a means for synthesizing fi-ames marked unvoiced; 
a means for synthesizing fi-ames marked voiced; 
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a means for quantizing signal information. 

68. The encoder of claim 67 further comprising a means for storing signal 
information. 

69. The encoder of claim 67 further comprising a means for transmitting signal 
information. 

70. The encoder of claim 67 wherein the means for determining a pitch value 
comprises: (a) a means for selecting pitch candidates from a first fi^ame and a second frame, 
(b) a means for calculating an error for a plurality of paths, each path including a pitch 
candidate from the first fi-ame and a pitch candidate firom the second fi-ame, and (c) a means 
for basing a pitch value on a path having a low error. 

71. The encoder of claim 67 further comprising a means for refining the pitch 

value. 

72. The encoder of claim 67 wherein frames marked voiced are divided into bands 
based on a pitch value. 

73. The encoder of claim 67 further comprising a means for modeling spectral 
amplitudes of synthesized spectra, the means for modeling synthesized spectra using a linear 
prediction technique, and a means for converting linear prediction coefficients to line spectral 
pairs. 

74. The encoder of claim 67 fiirther comprising a means for calculating a voicing 
parameter, the voicing parameter being weighted to compensate for voiced bands marked 
unvoiced and for unvoiced bands marked voiced, 

75. The encoder of claim 74 wherein the means for determining a pitch value 
comprises: (a) a means for selecting pitch candidates from a first frame and a second frame, 
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(b) a means for calculating an error for a plurality of paths, each path including a pitch 
candidate from the first frame and a pitch candidate from the second frame, and (c) a means 
for basing a pitch value on a path having a low error; 

the means for basing an unvoiced-voiced decision for bands of frames 
marked voiced on two error function comprising a means for dividing frames marked voiced 
into bands based on the pitch value; and the encoder further comprises: 
a means for refining the pitch value; 

a means for modeling spectral amplitudes of synthesized spectra, the means 
for modeling synthesized spectra using a linear prediction technique; and 

a means for converting linear prediction coefficients to line spectral pairs. 

76. The encoder of claim 67 wherein the means for making a voicing decision, the 
means for determining a pitch value, the means for basing a unvoiced-voiced decision for 
bands of frames marked voiced on two error functions, the means for synthesizing frames 
marked voiced, the means for synthesizing frames marked unvoiced, and the means for 
quantising signal information is a speech coder or a speech vocoder. 

77, A decoder for processing received data, comprising: 
a means for decoding received data; 

a means for synthesizing frames having unvoiced band information to produce 
unvoiced speech, the means for synthesizing frames having unvoiced band information 
including a random number generator, means for transforming random values from the 
random number generator to random phase values, means for assigning the random phase 
values to spectral amplitudes of a spectral shape vector to form a modified unvoiced 
spectrum, and means for taking an inverse Fourier transform of the modified unvoiced 
spectrum; 

a means for synthesizing frames having voiced band information to produce 
voiced speech; and 

a means for combing voiced speech and unvoiced speech. 
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78. The decoder of claim 77, wherein unvoiced band information, voiced band 
information, line spectral pair information, and pitch value information is decoded, the 
decoder further including means for converting line spectral pair information to a spectral 
shape vector. 

79. The decoder of claim 77, further including a means for initiahzing phases of 
harmonics of unvoiced bands to a fixed set of values, wherein the initializations are 
performed for transitions from unvoiced frames to voiced frames. 
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ABSTRACT OF THE DISCLOSURE 
The invention relates to improving parameter estimation and speech synthesis. 
Pursuant to one aspect of the invention, a path of pitch candidates having low errors is 
tracked to determine a pitch estimate. Pursuant to another aspect of the invention, a number 
of parameters are used to classify speech segments. Pursuant to another aspect of the 
invention, a voicing parameter is determined using a threshold value and bands are marked 
voiced or unvoiced depending on two error functions that compare synthesized voiced and 
unvoiced spectra to an original speech spectrum. Pursuant to another aspect of the invention 
a voicing parameter is used to facilitate lower bits for transmitting voicing decisions. Last, 
pursuant to other aspects of the invention, unvoiced speech is synthesized by incorporating a 
random generator, and harmonics phases are initialized with a fixed set of values. 
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Docket No. 
7416/78599 



Declaration and Power of Attorney For Patent Application 

English Language Declaration 

As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next to my name, 

I believe I am the original, first and sole inventor (if only one name is listed below) or an original, 

first and joint inventor (if plural names are listed below) of the subject matter which is claimed and for 

which a patent is sought on the invention entitled 

Modifications in the Multi-Band Excitation (MBE) Model for Generating High Quality Speech at Low Bit Rates 

the specification of which 

(check one) 

Kl Is attached hereto. 

□ was filed on as United States Application No. or PCT International 

Application Number 



and was amended on 



(if applicable) 

I hereby state that I have reviewed and understand the contents of the above identified specification, 
including the claims, as amended by any amendment referred to above. 

I acknowledge the duty to disclose to the United States Patent and Trademark Office all information 
known to me to be material to patentability as defined in Title 37, Code of Federal Regulations, 
Section 1.56, 

I hereby claim foreign priority benefits under Title 35, United States Code, Section 119(a)-(d) or 
Section 365(b) of any foreign applicatlon(s) for patent or inventor's certificate, or Section 365(a) of 
any PCT International application which designated at least one country other than the United States, 
listed below and have also identified below, by checking the box, any foreign application for patent or 
Inventor's certificate or PCT International application having a filing date before that of the application 
on which priority is claimed. 

Prior Foreign Application(s) Priority Not Claimed 

□ 

□ 
□ 



(Number) (Country) (Day/Month/Year Filed) 



(Number) ' (Country) (Day/MonthA'ear Filed) 



(Number) (Country) (Day/Month/Year Filed) 
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I hereby claim the benefit under 35 U.S.C. Section 119(e) of any United States provisional 
application(s) listed below: 



60/161,681 


October 26, 1999 


(Appiicaiion oenai ino.^ 


1 1^ 1 1 1 1 1 ^ L> a J 


(Application Serial No.) 


(Filing Date) 


(Application Serial No.) 


(Filing Date) 



I hereby claim the benefit under 35 U. S. C. Section 120 of any United States application(s), or 
Section 365(c) of any PCT International application designating the United States, listed below and, 
insofar as the subject matter of each of the claims of this application is not disclosed in the prior 
United States or PCT International application in the manner provided by the first paragraph of 35 
U.S.C. Section 112, I acknowledge the duty to disclose to the United States Patent and Trademark 
Q Office all information known to me to be material to patentability as defined in Title 37, C. F. R., 
^iS Section 1.56 which became available between the filing date of the prior application and the national 
^ or PCT International filing date of this application: 



S| (Application Serial No.) 


(Filing Date) 


(Status) 

(patented, pending, abandoned) 


f i (Application Serial No.) 


(Filing Date) 


(Status) 

(patented, pending, abandoned) 


r;i (Application Serial No.) 


(Filing Date) 


(Status) 

(patented, pending, abandoned) 



I hereby declare that all statements made herein of my own knowledge are true and that all 
statements made on information and belief are believed to be true; and further that these statements 
were made with the knowledge that willful false statements and the like so made are punishable by 
fine or imprisonment, or both, under Section 1001 of Title 18 of the United States Code and that such 
willful false statements may jeopardize the validity of the application or any patent issued thereon. 
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POWER OF ATTORNEY: As a named inventor, 1 hereby appoint the following attorney(s) and/or 
agent(s) to prosecute this application and transact all business in the Patent and Trademark Office 





connected therewith, (list name and registration number) 

A. Sidney Katz - 24,003 Thomas Tolpin - 27,600 
Richard Wood - 22,839 Jon Christensen, 34,137 
Jerold B. Schnayer - 28,903 Eric D, Cohen - 38,110 
Eric C. Cohen - 27,429 Walter J. Kawula, Jr., 39,724 
Joseph R. Marcus -25,060 Philip D. Segrest, Jr., 39,021 
Gerald S. Schur - 22,053 Mitchell J. Weinstein, 37,963 
Gerald T. Shekleton - 27,466 Maitreya P. Jani, 47,219 
James A. Scheer - 29,434 
Daniel R. Cherry - 29,054 
Ji^dward P. Cramson - 2V,Joi 
Kathleen A. Rheintgen - 34,044 






Send Correspondence to: ^^^^^ ^ L**^' 

JLXrU O. IVlVCl 31U.C X ImXiO. 

22nd Floor 

Chicago, Illinois 60606 
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Direct Telephone Calls to: (name and telephone number) 
Walter J. Kawula, Jr., 312/655-1500 












Full name of sole or first inventor 
Manoj Kumar Singhal 




Sole or first inventor's signature 


Date 




Residence 
India 




Citizenship 
Indian 




Post Office Address 

2337, 2nd "A" Cross, 16th "B" Main, Indira Nagar, HAL 2nd Stage, Bangalore, 560 075 India 







Full name of second inventor, If any 






Sangeetha 






Second inventor's signature 


Date 




Residence 






India 






Citizenship 






Indian 






Post Office Address 






1107, 26th "B" Main, 9th Block, Jayanagar, Bangalore, India 
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Full name of third inventor, if any 




Puranjoy Bhattacharya 




Third Inventor's signature 


Date 


Residence 




India 




Citizenship 




Indian 




Post Office Address 




203, 8th Main, Mathikere Extension, Bangalore, 560 054, India 







Full name of fourth inventor, if any 



Fourth inventor's signature 



Residence 



Citizenship 



Post Office Address 



Full name of fifth inventor, if any 



Fifth inventor's signature 



Residence 



Citizenship 



Post Office Address 



Full name of sixth inventor, if any 



Sixth inventor's signature 



Residence 



Citizenship 



Post Office Address 
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